Feedback should be send to goran.milovanovic@datakolektiv.com.
These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.


What do we want to do today?
In this session, we continue to develop the Binomial Logistic Regression model and introduce ROC (Receiver Operative Characteristic) analysis for classification problems. From ROC analysis, we will learn a lot about metrics for evaluating classification models: the True Positive, False Positive, True Negative, and False Negative rates; Precision and Recall; the F1 score. We will learn how to plot the model ROC curve and analyse it to determine the optimal decision threshold for any given Binomial Logistic Regression model.
We will optimize the Binomial logistic model directly using an optimization algorithm from the Scipy module which completes our theoretical study of this problem.
Finally, we introduce additional model indicators that take into account model complexity such as the Akaike Information Criterion (AIC). Finally, we regularize the Binomial Logistic Regression model and learn how to perform Binomial Logistic Regression in scikit-learn.
### --- Setup - importing the libraries
# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# - data
import numpy as np
import pandas as pd
# - os
import os
# - ml
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
# - visualization
import matplotlib.pyplot as plt
import seaborn as sns
# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None # default='warn'
sns.set_theme()
# - rng
rng = np.random.default_rng()
# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')
# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')
# - loading the dataset
# - GitHub: https://github.com/dijendersaini/Customer-Churn-Model/blob/master/churn_data.csv
# - place it in your _data/ directory
churn_data = pd.read_csv(os.path.join(data_dir, 'churn_data.csv'))
churn_data.head(10)
| customerID | tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | 8 | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | 22 | Yes | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | 10 | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | 28 | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | 62 | Yes | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
churn_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 tenure 7043 non-null int64 2 PhoneService 7043 non-null object 3 Contract 7043 non-null object 4 PaperlessBilling 7043 non-null object 5 PaymentMethod 7043 non-null object 6 MonthlyCharges 7043 non-null float64 7 TotalCharges 7043 non-null object 8 Churn 7043 non-null object dtypes: float64(1), int64(1), object(7) memory usage: 495.3+ KB
# - use .replace method to replace empty strings with NaN values
churn_data = churn_data.replace(r'^\s*$', np.nan, regex=True)
# - we drop all the entries with missing values
churn_data = churn_data.dropna().reset_index(drop=True)
# - notice that 'TotalCharges' values are non-numeric type, but they should be
# - this is due to the empty string values that were previously present
# - we convert them to numeric type
churn_data['TotalCharges'] = churn_data['TotalCharges'].astype('float')
churn_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7032 entries, 0 to 7031 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 tenure 7032 non-null int64 2 PhoneService 7032 non-null object 3 Contract 7032 non-null object 4 PaperlessBilling 7032 non-null object 5 PaymentMethod 7032 non-null object 6 MonthlyCharges 7032 non-null float64 7 TotalCharges 7032 non-null float64 8 Churn 7032 non-null object dtypes: float64(2), int64(1), object(6) memory usage: 494.6+ KB
### --- Preparing the model frame
# - extracting 'Churn' and all the numerical features columns
model_frame = churn_data[['Churn', 'tenure', 'MonthlyCharges', 'TotalCharges']]
model_frame.head()
| Churn | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| 0 | No | 1 | 29.85 | 29.85 |
| 1 | No | 34 | 56.95 | 1889.50 |
| 2 | Yes | 2 | 53.85 | 108.15 |
| 3 | No | 45 | 42.30 | 1840.75 |
| 4 | Yes | 2 | 70.70 | 151.65 |
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
| Churn | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| 0 | 0 | 1 | 29.85 | 29.85 |
| 1 | 0 | 34 | 56.95 | 1889.50 |
| 2 | 1 | 2 | 53.85 | 108.15 |
| 3 | 0 | 45 | 42.30 | 1840.75 |
| 4 | 1 | 2 | 70.70 | 151.65 |
predictors = model_frame.columns.drop('Churn')
predictors
Index(['tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')
# --- Composing the fomula of the model
# - right side of the formula
formula = ' + '.join(predictors)
# - left side of the formula
formula = 'Churn ~ ' + formula
formula
'Churn ~ tenure + MonthlyCharges + TotalCharges'
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
Current function value: 0.453372
Iterations 7
binomial_linear_model.summary()
| Dep. Variable: | Churn | No. Observations: | 7032 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 7028 |
| Method: | MLE | Df Model: | 3 |
| Date: | Tue, 02 May 2023 | Pseudo R-squ.: | 0.2170 |
| Time: | 11:30:29 | Log-Likelihood: | -3188.1 |
| converged: | True | LL-Null: | -4071.7 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -1.5988 | 0.117 | -13.628 | 0.000 | -1.829 | -1.369 |
| tenure | -0.0671 | 0.005 | -12.297 | 0.000 | -0.078 | -0.056 |
| MonthlyCharges | 0.0302 | 0.002 | 17.585 | 0.000 | 0.027 | 0.034 |
| TotalCharges | 0.0001 | 6.14e-05 | 2.361 | 0.018 | 2.47e-05 | 0.000 |
# - predicting the probabilities
probabilities = binomial_linear_model.predict()
probabilities[:10]
array([0.31861382, 0.13162129, 0.47723817, 0.04417324, 0.60445587,
0.72962176, 0.47459352, 0.2095354 , 0.53216748, 0.02770232])
# - predicting binary labels, taking \sigma = 0.5
predictions = (probabilities > .5).astype('int')
predictions[:10]
array([0, 0, 0, 0, 1, 1, 0, 0, 1, 0])
# - observed vs. predicted labels
predictions_df = pd.DataFrame()
predictions_df['observation'] = model_frame['Churn']
predictions_df['prediction'] = predictions
predictions_df.head()
| observation | prediction | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
True Positives (Hits): data say 1 and prediction is 1
# - model True Positive Rate (TPR, Hit rate)
tpr = (predictions_df['observation'] == 1)&(predictions_df['prediction'] == 1)
tpr = np.sum(tpr)/np.sum(predictions_df['observation'] == 1)
np.round(tpr, 4)
0.4425
False Positives (False Alarms): data say 0 and prediction is 1
# - model False Positive Rate (FPR, False Alarm rate)
fpr = (predictions_df['observation'] == 0)&(predictions_df['prediction'] == 1)
fpr = np.sum(fpr)/np.sum(predictions_df['observation'] == 0)
np.round(fpr, 4)
0.091
True Negatives (Correct Rejections): data say 0 and prediction is 0
# - model True Negative Rate (TNR, Correct Rejection rate)
tnr = (predictions_df['observation'] == 0)&(predictions_df['prediction'] == 0)
tnr = np.sum(tnr)/np.sum(predictions_df['observation'] == 0)
np.round(tnr, 4)
0.909
False Negatives (Misses): data say 1 and prediction is 0
# - model False Negative Rate (FNR, Miss rate)
fnr = (predictions_df['observation'] == 1)&(predictions_df['prediction'] == 0)
fnr = np.sum(fnr)/np.sum(predictions_df['observation'] == 1)
np.round(fnr, 4)
0.5575
True Positive Rate + False Negative Rate (or Hit + Miss)
tpr + fnr
1.0
False Positive Rate + True Negative Rate (or False Alarm + Correct Rejection)
fpr + tnr
1.0
### --- Constructing the ROC curve of the model
# - calculating model metrics for varying decision criteria \sigma
dec_criterion = np.arange(.01, .99, step=.01)
observations = model_frame['Churn']
hits = []
fas = []
accuracies = []
for x in dec_criterion:
predictions = probabilities > x
accuracy = np.sum(observations == predictions)/len(observations)
hit = np.sum((observations == 1)&(predictions == 1))/np.sum(observations == 1)
fa = np.sum((observations == 0)&(predictions == 1))/np.sum(observations == 0)
accuracies.append(accuracy)
fas.append(fa)
hits.append(hit)
roc_frame = pd.DataFrame()
roc_frame['hit'] = hits
roc_frame['fa'] = fas
roc_frame['accuracy'] = accuracies
roc_frame['dec'] = dec_criterion
roc_frame.head()
| hit | fa | accuracy | dec | |
|---|---|---|---|---|
| 0 | 0.998930 | 0.935890 | 0.312571 | 0.01 |
| 1 | 0.993579 | 0.889793 | 0.344994 | 0.02 |
| 2 | 0.989834 | 0.847182 | 0.375284 | 0.03 |
| 3 | 0.987694 | 0.809413 | 0.402446 | 0.04 |
| 4 | 0.979668 | 0.771063 | 0.428470 | 0.05 |
# - difference between the hit and false alarm rates
roc_frame['diff'] = roc_frame['hit'] - roc_frame['fa']
roc_frame.head()
| hit | fa | accuracy | dec | diff | |
|---|---|---|---|---|---|
| 0 | 0.998930 | 0.935890 | 0.312571 | 0.01 | 0.063040 |
| 1 | 0.993579 | 0.889793 | 0.344994 | 0.02 | 0.103787 |
| 2 | 0.989834 | 0.847182 | 0.375284 | 0.03 | 0.142652 |
| 3 | 0.987694 | 0.809413 | 0.402446 | 0.04 | 0.178281 |
| 4 | 0.979668 | 0.771063 | 0.428470 | 0.05 | 0.208605 |
# - identifying the entry with the highest hit-false alarm rate difference
diff_argmax = roc_frame['diff'].argmax()
print(f'The optimal model is:\n{roc_frame.iloc[diff_argmax]}')
The optimal model is: hit 0.784912 fa 0.316483 accuracy 0.710466 dec 0.250000 diff 0.468429 Name: 24, dtype: float64
# - plotting the ROC curve of the model
# - the point with the biggest hit-false alarm rate difference is marked
sns.lineplot(data=roc_frame, x='dec', y='dec', color='black')
sns.lineplot(data=roc_frame, x='fa', y='hit')
plt.scatter(x=roc_frame.loc[diff_argmax, 'fa'], y=roc_frame.loc[diff_argmax, 'hit'], c='r')
plt.text(x=roc_frame.loc[diff_argmax, 'fa']-.08, y=roc_frame.loc[diff_argmax, 'hit'], s='Here!')
sns.despine(offset=10, trim=True)
plt.xlabel('True Negative (False Alarm) Rate', fontsize=14)
plt.ylabel('True Positive (Hit) Rate', fontsize=14)
plt.title('ROC Analysis for the Binomial Regression Model', fontsize=16);
The Akaike Information Criterion (AIC) is a statistical measure used to evaluate the goodness-of-fit of a model. It is based on the principle of parsimony, which states that simpler models should be preferred over more complex ones, all else being equal.
The AIC is defined as follows:
$$AIC = -2\ln(\mathcal{L}) + 2k $$where $\mathcal{L}$ is the model likelihood and $k$ is the number of parameters in the model.
The AIC penalizes models with more parameters, by adding a penalty term $2k$ to the log-likelihood $-2\ln(\mathcal{L})$. This penalty term is larger for models with more parameters, and hence it discourages overfitting and encourages simpler models.
The AIC can be used to compare different models and select the best one based on their AIC values. The model with the lowest AIC value is preferred, as it strikes a good balance between goodness-of-fit and simplicity.
# - Akaike Information Criterion (AIC)
binomial_linear_model.aic
6384.226174634843
# - another way to compute AIC
model_loglike = binomial_linear_model.llf
model_loglike
aic = -2*model_loglike + 2*len(predictors)
aic
6382.226174634843
In Binomial Logistic Regression, the null model is a model with no predictor variables, meaning that it only includes an intercept term. The null model predicts the probability of the outcome variable based on its overall frequency in the sample, without taking any other variables into account.
Model Log-Likelihood
# - Log-likelihood of model
model_loglike = binomial_linear_model.llf
model_loglike
-3188.1130873174216
Null Model Log-Likelihood
# Value of the constant-only loglikelihood
null_loglike = binomial_linear_model.llnull
null_loglike
-4071.6775733255813
# - Comparison to the Null Model which follows the Chi-Square distribution
# - Likelihood ratio chi-squared statistic; G = -2*(llnull - llf)
dev_diff = binomial_linear_model.llr
dev_diff
1767.1289720163195
The Likelihood ratio Chi-Squared statistic (sometimes termed: G) follows the $\chi^2$ distirbution.
print(f'G: {-2*(null_loglike - model_loglike)}')
print(binomial_linear_model.llr_pvalue)
G: 1767.1289720163195 0.0
# - exponential of the parameters and AIC of the model using only numerical predictors (a reminder)
np.exp(binomial_linear_model.params)
Intercept 0.202133 tenure 0.935088 MonthlyCharges 1.030660 TotalCharges 1.000145 dtype: float64
binomial_linear_model.aic
6384.226174634843
### --- Prepering the dataset
# - droping the 'customerID' column
model_frame = churn_data.drop(columns='customerID')
model_frame.head()
| tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
model_frame.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7032 entries, 0 to 7031 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tenure 7032 non-null int64 1 PhoneService 7032 non-null object 2 Contract 7032 non-null object 3 PaperlessBilling 7032 non-null object 4 PaymentMethod 7032 non-null object 5 MonthlyCharges 7032 non-null float64 6 TotalCharges 7032 non-null float64 7 Churn 7032 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 439.6+ KB
### --- Preparing the dataset
# - droping the 'customerID' column
model_frame = churn_data.drop(columns='customerID')
model_frame.head()
| tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
# - encoding 'Churn' column to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
| tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | 0 |
| 1 | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | 0 |
| 2 | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | 1 |
| 3 | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | 0 |
| 4 | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | 1 |
predictors = model_frame.columns.drop('Churn')
predictors
Index(['tenure', 'PhoneService', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges'],
dtype='object')
# --- Composing the fomula of the model
# - right side of the formula
formula = ' + '.join(predictors)
# - left side of the formula
formula = 'Churn ~ ' + formula
formula
'Churn ~ tenure + PhoneService + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges'
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
Current function value: 0.424667
Iterations 8
binomial_linear_model.summary()
| Dep. Variable: | Churn | No. Observations: | 7032 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 7021 |
| Method: | MLE | Df Model: | 10 |
| Date: | Tue, 02 May 2023 | Pseudo R-squ.: | 0.2666 |
| Time: | 11:30:32 | Log-Likelihood: | -2986.3 |
| converged: | True | LL-Null: | -4071.7 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -0.8446 | 0.176 | -4.794 | 0.000 | -1.190 | -0.499 |
| PhoneService[T.Yes] | -0.8419 | 0.122 | -6.926 | 0.000 | -1.080 | -0.604 |
| Contract[T.One year] | -0.9171 | 0.103 | -8.876 | 0.000 | -1.120 | -0.715 |
| Contract[T.Two year] | -1.8135 | 0.172 | -10.565 | 0.000 | -2.150 | -1.477 |
| PaperlessBilling[T.Yes] | 0.4230 | 0.073 | 5.812 | 0.000 | 0.280 | 0.566 |
| PaymentMethod[T.Credit card (automatic)] | -0.1006 | 0.112 | -0.895 | 0.371 | -0.321 | 0.120 |
| PaymentMethod[T.Electronic check] | 0.4126 | 0.093 | 4.453 | 0.000 | 0.231 | 0.594 |
| PaymentMethod[T.Mailed check] | -0.1194 | 0.113 | -1.061 | 0.289 | -0.340 | 0.101 |
| tenure | -0.0606 | 0.006 | -9.932 | 0.000 | -0.073 | -0.049 |
| MonthlyCharges | 0.0225 | 0.002 | 11.022 | 0.000 | 0.019 | 0.027 |
| TotalCharges | 0.0003 | 6.81e-05 | 4.205 | 0.000 | 0.000 | 0.000 |
# - exponentials of the new model parameters
np.exp(binomial_linear_model.params)
Intercept 0.429725 PhoneService[T.Yes] 0.430907 Contract[T.One year] 0.399695 Contract[T.Two year] 0.163084 PaperlessBilling[T.Yes] 1.526520 PaymentMethod[T.Credit card (automatic)] 0.904265 PaymentMethod[T.Electronic check] 1.510691 PaymentMethod[T.Mailed check] 0.887475 tenure 0.941170 MonthlyCharges 1.022802 TotalCharges 1.000286 dtype: float64
# - AIC of the new model
binomial_linear_model.aic
5994.512151127206
### --- Constructing the ROC curve of the model
# - calculating model metrics for varying decision criteria \sigma
dec_criterion = np.arange(.01, .99, step=.01)
probabilities = binomial_linear_model.predict()
observations = model_frame['Churn']
hits = []
fas = []
accuracies = []
for x in dec_criterion:
predictions = probabilities > x
accuracy = np.sum(observations == predictions)/len(observations)
hit = np.sum((observations == 1)&(predictions == 1))/np.sum(observations == 1)
fa = np.sum((observations == 0)&(predictions == 1))/np.sum(observations == 0)
accuracies.append(accuracy)
fas.append(fa)
hits.append(hit)
roc_frame = pd.DataFrame()
roc_frame['hit'] = hits
roc_frame['fa'] = fas
roc_frame['accuracy'] = accuracies
roc_frame['dec'] = dec_criterion
roc_frame['diff'] = roc_frame['hit'] - roc_frame['fa']
roc_frame.head()
| hit | fa | accuracy | dec | diff | |
|---|---|---|---|---|---|
| 0 | 0.996255 | 0.872942 | 0.358077 | 0.01 | 0.123313 |
| 1 | 0.990904 | 0.791788 | 0.416240 | 0.02 | 0.199117 |
| 2 | 0.986624 | 0.731551 | 0.459329 | 0.03 | 0.255072 |
| 3 | 0.983414 | 0.688166 | 0.490330 | 0.04 | 0.295248 |
| 4 | 0.979668 | 0.653883 | 0.514505 | 0.05 | 0.325785 |
# - identifying the entry with the highest hit-false alarm rate difference
diff_argmax = roc_frame['diff'].argmax()
print(f'The optimal model is:\n{roc_frame.iloc[diff_argmax]}')
The optimal model is: hit 0.838416 fa 0.318420 accuracy 0.723265 dec 0.240000 diff 0.519997 Name: 23, dtype: float64
# - plotting the ROC curve of the model
# - the point with the biggest hit-false alarm rate difference is marked
sns.lineplot(data=roc_frame, x='dec', y='dec', color='black')
sns.lineplot(data=roc_frame, x='fa', y='hit')
plt.scatter(x=roc_frame.loc[diff_argmax, 'fa'], y=roc_frame.loc[diff_argmax, 'hit'], c='r')
plt.text(x=roc_frame.loc[diff_argmax, 'fa']-.08, y=roc_frame.loc[diff_argmax, 'hit'], s='Here!')
sns.despine(offset=10, trim=True)
plt.xlabel('True Negative (False Alarm) Rate', fontsize=14)
plt.ylabel('True Positive (Hit) Rate', fontsize=14)
plt.title('ROC Analysis for the Binomial Regression Model', fontsize=16);
The ROC plot can thus also be understood as $1-Specificity$ vs $Sensitivity$ plot (beacuse $FPR = 1 - Specificity$). In mathematical statistics, the Type I Error indicates a situation in which our statistical model predicts something which is not the case in the population (that is your $\alpha$ level in statistical analyses): this concept is completely mapped by our $FPR$ or False Alarm. On the other hand, Statistical Power is the probability by which a statistical model (or test) can successfully recover an occurrence in the population: and this is perfectly matched by our understanding of $TPR$ or Hit Rate. Thus, the ROC also plots the Type I Error against Statistical Power.
Precision (or Positive Predictive Value (PPV)) and False Discovery Rate (FDR)
$$PPV = \frac{TP}{TP+FP}=1-FDR$$The classifier's Precision is the ratio of True Positives to the sum of True Positives and False Positives: the ratio of correct ("relevant") classifications to the number of positive classifications made.
$$FDR = \frac{FP}{FP+TP}=1-PPV$$The classifier's False Discovery Rate is the ratio of False Positives to the sum of True Positives and False Positives: the ratio of incorrect ("irrelevant") classifications to the number of positive classifications made.
Accuracy and Balanced Accuracy
In cases of highly imbalanced classes in binary classification, Accuracy can give us a dangerously misleading result:
$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$Balanced Accuracy ($bACC$) can be used to correct for class imbalance:
$$bACC=\frac{TPR+TNR}{2}$$To understand how $bACC$ works, think of the following case: we have a dataset with 100 observations of which 75 are class $C_1$ and 25 of class $C_2$; hence, the model that always predict $C_1$ and never $C_2$ must be accurate 75% of time, right? However, it's $bACC$ is only .5 (because its $TPR=1$ and $TNR=0$).
The $F_1$ score
This is traditionally used to asses how good a classifier is:
$$F_1 = 2\frac{Precision\cdot Recall}{Precision+Recall}$$and is also known as F-measure or balanced F-score.
Note.
We can thus use a more general $F-score$, $F_\beta$, where $\beta$ determines the times recall is considered as important as precision:
$$F_\beta = (1+\beta^2)\frac{Precision\cdot Recall}{\beta^2 \cdot Precision+Recall}$$Area Under the Curve (AUC)
This is probably the most frequently used indicator of model fit in classification. Given that the ideal classifier is found in the top left corner of the ROC plot, i.e. where $TPR=1$ and $FPR=0$, it is obvious that the best model in some comparison is the one with the greatest area under the ROC.
# - ROC Analysis continued
# - identifying the entry with the highest hit-false alarm rate difference
diff_argmax = roc_frame['diff'].argmax()
optimal_model = roc_frame.iloc[diff_argmax]
print(f'The optimal model is:\n{optimal_model}')
The optimal model is: hit 0.838416 fa 0.318420 accuracy 0.723265 dec 0.240000 diff 0.519997 Name: 23, dtype: float64
Balanced Accuracy to correct for class imbalance: $bACC=\frac{TPR+TNR}{2}$
#bACC
tpr = optimal_model['hit']
tnr = 1-optimal_model['fa']
bACC = (tpr+tnr)/2
print(f'The model bACC: {bACC:.3f}')
The model bACC: 0.760
Precision (or Positive Predictive Value (PPV)): $PPV = \frac{TP}{TP+FP}=1-FDR$
# Precision
tpr = optimal_model['hit']
fpr = optimal_model['fa']
precision = tpr/(tpr+fpr)
print(f'The model Precision: {precision:.3f}')
The model Precision: 0.725
The model Recall is simply the Hit rate (TPR):
# Recall
recall = optimal_model['hit']
print(f'The model Recall: {recall:.3f}')
The model Recall: 0.838
The $F_1$ score
$$F_1 = 2\frac{Precision\cdot Recall}{Precision+Recall}$$# F1 Score
f1 = 2*((precision*recall)/(precision+recall))
print(f'The model F1 score: {f1:.3f}')
The model F1 score: 0.777
sklearn: Predicting churn from numerical predictors¶# - import scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Predicting with numerical variables only:
### --- Preparing the variables
# - feature matrix
X = churn_data[['tenure', 'MonthlyCharges', 'TotalCharges']].values
# - target variable
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
## --- Fitting the logistic model to the numerical data
log_reg = LogisticRegression()
log_reg.fit(X, y)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# - coefficients of the model
log_reg.coef_, log_reg.intercept_
(array([[-0.06711264, 0.03019993, 0.00014507]]), array([-1.59884834]))
# - coefficients of the model
log_reg.coef_, log_reg.intercept_
(array([[-0.06711264, 0.03019993, 0.00014507]]), array([-1.59884834]))
# - model's accuracy
round(log_reg.score(X, y), 4)
0.785
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
array([[4693, 470],
[1042, 827]])
sklearn: Predicting churn from all the predictors¶churn_data.head()
| customerID | tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 3668-QPYBK | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
### --- Composing the feature matrix
# - dropping all the non-numerical and non-binary categorical columns
X0 = churn_data.drop(columns=['customerID', 'Contract', 'PaymentMethod', 'Churn'])
# - encoding binary categorical features to binary values
X0['PaperlessBilling'] = X0['PaperlessBilling'].apply(lambda x: int(x == 'Yes'))
X0['PhoneService'] = X0['PhoneService'].apply(lambda x: int(x == 'Yes'))
X0.head()
| tenure | PhoneService | PaperlessBilling | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 29.85 | 29.85 |
| 1 | 34 | 1 | 0 | 56.95 | 1889.50 |
| 2 | 2 | 1 | 1 | 53.85 | 108.15 |
| 3 | 45 | 0 | 0 | 42.30 | 1840.75 |
| 4 | 2 | 1 | 1 | 70.70 | 151.65 |
# - casting the data frame into a matrix
X0 = X0.values
X0
array([[1.0000e+00, 0.0000e+00, 1.0000e+00, 2.9850e+01, 2.9850e+01],
[3.4000e+01, 1.0000e+00, 0.0000e+00, 5.6950e+01, 1.8895e+03],
[2.0000e+00, 1.0000e+00, 1.0000e+00, 5.3850e+01, 1.0815e+02],
...,
[1.1000e+01, 0.0000e+00, 1.0000e+00, 2.9600e+01, 3.4645e+02],
[4.0000e+00, 1.0000e+00, 1.0000e+00, 7.4400e+01, 3.0660e+02],
[6.6000e+01, 1.0000e+00, 1.0000e+00, 1.0565e+02, 6.8445e+03]])
# - categories of the 'Contract' variable
churn_data['Contract'].unique()
array(['Month-to-month', 'One year', 'Two year'], dtype=object)
# - categories of the 'PaymentMethod' variable
churn_data['PaymentMethod'].unique()
array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
'Credit card (automatic)'], dtype=object)
# - we want to recreate the previous statsmodels model that was using all the predictors
# - to achieve this we one-hot (dummy) encode non-binary categorical predictors
# - statsmodels chooses the first category in order of appearance in the dataset as the reference category
# - we pass the reference category manually as an argument to the OneHotEncoder
enc_contract = OneHotEncoder(drop=['Month-to-month'], sparse=False)
dummy_contract = enc_contract.fit_transform(churn_data['Contract'].values.reshape(-1, 1))
enc_payment = OneHotEncoder(drop=['Bank transfer (automatic)'], sparse=False)
dummy_payment = enc_payment.fit_transform(churn_data['PaymentMethod'].values.reshape(-1, 1))
# - concatenating values of the numerical predictors and encoded binary values with the encoded non-binary values
# - into a feature matrix
X = np.concatenate((X0, dummy_contract, dummy_payment), axis=-1)
display(X)
# - target variable; encoding to binary values
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
array([[ 1., 0., 1., ..., 0., 1., 0.],
[34., 1., 0., ..., 0., 0., 1.],
[ 2., 1., 1., ..., 0., 0., 1.],
...,
[11., 0., 1., ..., 0., 1., 0.],
[ 4., 1., 1., ..., 0., 0., 1.],
[66., 1., 1., ..., 0., 0., 0.]])
### --- Fitting the logistic model to all the data
log_reg = LogisticRegression(solver='newton-cg', penalty='none')
log_reg.fit(X, y)
LogisticRegression(penalty='none', solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(penalty='none', solver='newton-cg')
# - model's accuracy
round(log_reg.score(X, y), 4)
0.7964
# - exponential of the model parameters
# - ordering corresponds to the ordering of the features in the feature matrix
np.exp(log_reg.coef_), np.exp(log_reg.intercept_)
(array([[0.9411695 , 0.43090698, 1.52652006, 1.0228018 , 1.00028639,
0.39969521, 0.16308371, 0.90426493, 1.51069097, 0.88747494]]),
array([0.42972508]))
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
array([[4615, 548],
[ 884, 985]])
Say we have observed the following data: $HHTHTTHHHT$. Assume that we know the parameter $p_H$. We can compute the Likelihood function from the following equation:
$\mathcal{L}(p_H|HHTHTTHHHT)$ exactly as we did before. Now, this is the general form of the Binomial Likelihood (where $Y$ stands for the observed data):
$$\mathcal{L}(p|Y) = p_1^y(1-p_1)^{n-y}$$where $y$ is the number of successes and $n$ the total number of observations. For each observed data point then we have
$$\mathcal{L}(p|y_i) = p_1^{y_i}(1-p_1)^{\bar{y_i}}$$where ${y_i}$ is the observed value of the outcome, $Y$, and $\bar{y_i}$ is its complement (e.g. $1$ for $0$ and $0$ for $1$). This form just determines which value will be used in the computation of the Likelihood function at each observed data point $y_i$: it will be either $p_1$ or $1-p_1$. The likelihood function for a given value of $p_1$ for the whole dataset is computed by multiplying the values of $\mathcal{L}(p|y_i)$ across the whole dataset (remember that multiplication in Probability is what conjunction is in Logic and Algebra).
Q: But... how do we get to $p_1$, the parameter value that we will use at each data point?
A: We will search the parameter space, of course, $\beta_0, \beta_1, ... \beta_k$ of linear coefficients in our Binary Logistic Model, computing $p_1$ every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of $\beta_0, \beta_1, ... \beta_k$ that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.
So what combination of the linear coefficients is the best one?
It is the one which gives the Maximum Likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have earlier used to estimate Simple and Multiple Linear Regression models.
Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points (because $p_1 < 1$, and even $p_1 \ll 1$). That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as Log-Likelihood ($LL$).
Thus, while the Likelihood function for the whole dataset would be
$$\mathcal{L}(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}$$the Log-Likelihood function would be:
$$LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)$$And finally here is how we solve the Binomial Logistic Regression problem:
Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use $LL$ instead of $\mathcal{L}(p|Y)$. The solution is to minimize the negative $LL$, sometimes written simply as $NLL$, the Negative Log-Likelihood function.
model_frame = churn_data[['Churn', 'MonthlyCharges', 'TotalCharges']]
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame
| Churn | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| 0 | 0 | 29.85 | 29.85 |
| 1 | 0 | 56.95 | 1889.50 |
| 2 | 1 | 53.85 | 108.15 |
| 3 | 0 | 42.30 | 1840.75 |
| 4 | 1 | 70.70 | 151.65 |
| ... | ... | ... | ... |
| 7027 | 0 | 84.80 | 1990.50 |
| 7028 | 0 | 103.20 | 7362.90 |
| 7029 | 0 | 29.60 | 346.45 |
| 7030 | 1 | 74.40 | 306.60 |
| 7031 | 0 | 105.65 | 6844.50 |
7032 rows × 3 columns
Implement the model predictive pass given the parameters; the folowing blr_predict() function is nothing else than the implementation of the following expression:
def blr_predict(params, data):
# - grab parameters
beta_0 = params[0]
beta_1 = params[1]
beta_2 = params[2]
# - predict: model function
x1 = data["MonthlyCharges"]
x2 = data["TotalCharges"]
# - linear combination:
lc = beta_0 + beta_1*x1 + beta_2*x2
ep = np.exp(-lc)
p = 1/(1+ep)
return(p)
Test blr_predict()
test_params = np.array([-2.7989, .0452, -.0006])
predictions = blr_predict(params=test_params, data=model_frame)
predictions
0 0.187309
1 0.204491
2 0.394181
3 0.120110
4 0.575848
...
7027 0.460025
7028 0.072292
7029 0.158578
7030 0.593878
7031 0.106194
Length: 7032, dtype: float64
Now define the Negative Log-Likelihood function:
from scipy.stats import binom
def blr_nll(params, data):
# - predictive pass
p = blr_predict(params, data)
# - joint negative log-likelihood
# - across all observations
nll = -binom.logpmf(data["Churn"], 1, p).sum()
return(nll)
Test blr_nll():
# - test blr_nll()
test_params = np.array([-2.7989, .0452, -.0006])
blr_nll(params=test_params, data=model_frame)
3284.57719221028
Optimize!
from scipy.optimize import minimize
# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-3, high=0, size=1)
init_beta_1 = np.random.uniform(low=-.05, high=.05, size=1)
init_beta_2 = np.random.uniform(low=-.001, high=.001, size=1)
init_pars = [init_beta_0, init_beta_1, init_beta_2]
# - optimize w. Nelder-Mead
optimal_model = minimize(
# - fun(parameters, args)
fun=blr_nll,
args = model_frame,
x0 = init_pars,
method='Nelder-Mead',
options={'maxiter':1e6,
'maxfev':1e6,
'fatol':1e-6})
# - optimal parameters
for param in optimal_model.x:
print(param)
-2.7988881731536672 0.045230622773866344 -0.000618151161753269
/var/folders/d5/r01186gs5wn60_lwmj4g21cm0000gn/T/ipykernel_58772/465358202.py:10: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0. optimal_model = minimize(
optimal_model.fun
3283.393916036912
Check against statsmodels
predictors = model_frame.columns.drop('Churn')
formula = ' + '.join(predictors)
formula = 'Churn ~ ' + formula
print(formula)
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
binomial_linear_model.summary()
Churn ~ MonthlyCharges + TotalCharges
Optimization terminated successfully.
Current function value: 0.466922
Iterations 6
| Dep. Variable: | Churn | No. Observations: | 7032 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 7029 |
| Method: | MLE | Df Model: | 2 |
| Date: | Tue, 02 May 2023 | Pseudo R-squ.: | 0.1936 |
| Time: | 11:30:36 | Log-Likelihood: | -3283.4 |
| converged: | True | LL-Null: | -4071.7 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -2.7989 | 0.089 | -31.548 | 0.000 | -2.973 | -2.625 |
| MonthlyCharges | 0.0452 | 0.001 | 31.620 | 0.000 | 0.042 | 0.048 |
| TotalCharges | -0.0006 | 1.97e-05 | -31.373 | 0.000 | -0.001 | -0.001 |
Plot the Model Log-Likelihood Function
# - from statsmodels: beta_0 is -2.7989, beta_1 is .0452, and beta_2 is -.0006
beta_0 = -2.7989
beta_1_vals = np.linspace(-.05,.05,100)
beta_2_vals = np.linspace(-.001,.001,100)
grid = np.array([(beta_1, beta_2) for beta_1 in beta_1_vals for beta_2 in beta_2_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_1", 1: "beta_2"})
nll = []
for i in range(grid.shape[0]):
pars = [beta_0, grid['beta_1'][i], grid['beta_2'][i]]
l = -blr_nll(pars, model_frame)
nll.append(l)
grid['nll'] = nll
grid.sort_values('nll', ascending=False, inplace=True)
grid.head(100)
| beta_1 | beta_2 | nll | |
|---|---|---|---|
| 9419 | 0.044949 | -0.000616 | -3283.545964 |
| 9518 | 0.045960 | -0.000636 | -3283.967556 |
| 9420 | 0.044949 | -0.000596 | -3284.466526 |
| 9519 | 0.045960 | -0.000616 | -3285.218180 |
| 9320 | 0.043939 | -0.000596 | -3285.320667 |
| ... | ... | ... | ... |
| 9917 | 0.050000 | -0.000657 | -3324.443332 |
| 9027 | 0.040909 | -0.000455 | -3324.735600 |
| 9315 | 0.043939 | -0.000697 | -3324.857073 |
| 9523 | 0.045960 | -0.000535 | -3325.140288 |
| 9118 | 0.041919 | -0.000636 | -3325.271093 |
100 rows × 3 columns
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
x=grid['beta_1'],
y=grid['beta_2'],
z=grid['nll'],
color='red',
opacity=0.50)])
fig.update_layout(scene = dict(
xaxis_title='Beta_1',
yaxis_title='Beta_2',
zaxis_title='LL'),
width=700,
margin=dict(r=20, b=10, l=10, t=10))
fig.show()
We will use scikit-learn to regularize the BLR model immediately.
### --- Preparing the variables
# - feature matrix
X = churn_data[['tenure', 'MonthlyCharges', 'TotalCharges']].values
# - target variable
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
It is recommended to standardize the feature matrix $X$ before using regularized logistic regression in scikit-learn, particularly if the features have different scales. Regularization techniques like $L1$ and $L2$ regularization are sensitive to the scale of the features, and features with larger scales may dominate the regularization penalty. Standardizing the features can help to mitigate this issue and ensure that all features contribute equally to the model.
To standardize the features in scikit-learn, we use the StandardScaler transformer from the sklearn.preprocessing module:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Please make sure to study the the LogisticRegression() documentation from Scikit-Learn: sklearn.linear_model.LogisticRegression:
Cfloat, default=1.0Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
# Create an instance of the logistic regression model with L1 regularization
logistic_model = LogisticRegression(penalty='l1',
C = .15,
solver='liblinear')
# Fit the model on the training data
logistic_model.fit(X, y)
# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.43662005] [[-1.42088886 0.9443366 0.10335326]]
Examine a range of C inverse regularization penalty:
# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]
# create empty list to store the score for each C value
scores = []
# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
model = LogisticRegression(penalty='l1', C=C, solver='liblinear').fit(X,y)
sc = model.score(X, y)
scores.append(sc)
# print the mean and standard deviation of the scores for each C value
print(f"C = {C:.3f}: Score = {sc:.3f}")
# find the best C value based on the mean score
best_C = C_range[scores.index(max(scores))]
print(f"Best C = {best_C:.3f}")
C = 0.001: Score = 0.734 C = 0.010: Score = 0.789 C = 0.100: Score = 0.785 C = 1.000: Score = 0.785 C = 10.000: Score = 0.785 C = 100.000: Score = 0.785 Best C = 0.010
# Create an instance of the logistic regression model with L2 regularization
logistic_model = LogisticRegression(penalty='l2',
C = .15,
solver='liblinear')
# Fit the model on the training data
logistic_model.fit(X, y)
# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.4361844] [[-1.4546272 0.929324 0.14909584]]
Examine a range of C inverse regularization penalty:
# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]
# create empty list to store the score for each C value
scores = []
# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
model = LogisticRegression(penalty='l2', C=C, solver='liblinear').fit(X,y)
sc = model.score(X, y)
scores.append(sc)
# print the mean and standard deviation of the scores for each C value
print(f"C = {C:.3f}: Score = {sc:.3f}")
# find the best C value based on the mean score
best_C = C_range[scores.index(max(scores))]
print(f"Best C = {best_C:.3f}")
C = 0.001: Score = 0.789 C = 0.010: Score = 0.787 C = 0.100: Score = 0.785 C = 1.000: Score = 0.785 C = 10.000: Score = 0.785 C = 100.000: Score = 0.785 Best C = 0.001
l1_ratiofloat, default=NoneThe Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
# Create an instance of the logistic regression model with L2 regularization
logistic_model = LogisticRegression(penalty='elasticnet',
l1_ratio = .5,
C = .15,
solver='saga')
# Fit the model on the training data
logistic_model.fit(X, y)
# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.45021784] [[-1.46190326 0.93678774 0.14573006]]
# Create an instance of the logistic regression model with L2 regularization
logistic_model = LogisticRegression(penalty='elasticnet',
l1_ratio = .5,
C = .15,
max_iter=10000,
solver='saga')
# Fit the model on the training data
logistic_model.fit(X, y)
# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.45022182] [[-1.46186556 0.93679974 0.14570234]]
Now, using F1 to score the model:
from sklearn.metrics import f1_score
# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]
# define the range of l1_ratio (ElasticNet mix parameter) values to try
l1_range = [0, .25, .5, .75, 1]
# create empty list to store the score for each C value
scores = []
cs = []
l1_ratios = []
f1s = []
# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
for l1 in l1_range:
model = LogisticRegression(penalty='elasticnet',
C = C,
l1_ratio = l1,
max_iter = 10000,
solver='saga').fit(X,y)
sc = model.score(X, y)
scores.append(sc)
# store C, l1_ratio
cs.append(C)
l1_ratios.append(l1)
# predict the outcome on new data
y_pred = model.predict(X)
# calculate the F1 score
f1 = f1_score(y, y_pred)
f1s.append(f1)
# print the score for actual C, l1 value, accuracy, and F1
print(f"C = {C:.3f}, l1_ratio = {l1:.3f}: Accuracy = {sc:.7f}, F1 = {f1:.7f}")
# find the best C, l1_ratio values based on F1
best_ix = f1s.index(max(f1s))
best_C = cs[best_ix]
print(f"Best C = {best_C:.3f}")
best_l1_ratio = l1_ratios[best_ix]
print(f"Best l1_ratio = {best_l1_ratio:.3f}")
best_f1 = f1s[best_ix]
print(f"Best F1 = {best_f1:.7f}")
best_acc = scores[best_ix]
print(f"Best Accuracy = {best_acc:.7f}")
C = 0.001, l1_ratio = 0.000: Accuracy = 0.7417520, F1 = 0.0677618 C = 0.001, l1_ratio = 0.250: Accuracy = 0.7342150, F1 = 0.0000000 C = 0.001, l1_ratio = 0.500: Accuracy = 0.7342150, F1 = 0.0000000 C = 0.001, l1_ratio = 0.750: Accuracy = 0.7342150, F1 = 0.0000000 C = 0.001, l1_ratio = 1.000: Accuracy = 0.7342150, F1 = 0.0000000 C = 0.010, l1_ratio = 0.000: Accuracy = 0.7889647, F1 = 0.5089345 C = 0.010, l1_ratio = 0.250: Accuracy = 0.7891069, F1 = 0.5061605 C = 0.010, l1_ratio = 0.500: Accuracy = 0.7901024, F1 = 0.5050302 C = 0.010, l1_ratio = 0.750: Accuracy = 0.7899602, F1 = 0.5045287 C = 0.010, l1_ratio = 1.000: Accuracy = 0.7902446, F1 = 0.5051996 C = 0.100, l1_ratio = 0.000: Accuracy = 0.7851251, F1 = 0.5204697 C = 0.100, l1_ratio = 0.250: Accuracy = 0.7852673, F1 = 0.5209391 C = 0.100, l1_ratio = 0.500: Accuracy = 0.7854096, F1 = 0.5217116 C = 0.100, l1_ratio = 0.750: Accuracy = 0.7854096, F1 = 0.5223172 C = 0.100, l1_ratio = 1.000: Accuracy = 0.7852673, F1 = 0.5221519 C = 1.000, l1_ratio = 0.000: Accuracy = 0.7848407, F1 = 0.5219589 C = 1.000, l1_ratio = 0.250: Accuracy = 0.7848407, F1 = 0.5219589 C = 1.000, l1_ratio = 0.500: Accuracy = 0.7848407, F1 = 0.5219589 C = 1.000, l1_ratio = 0.750: Accuracy = 0.7848407, F1 = 0.5219589 C = 1.000, l1_ratio = 1.000: Accuracy = 0.7848407, F1 = 0.5219589 C = 10.000, l1_ratio = 0.000: Accuracy = 0.7848407, F1 = 0.5222608 C = 10.000, l1_ratio = 0.250: Accuracy = 0.7848407, F1 = 0.5222608 C = 10.000, l1_ratio = 0.500: Accuracy = 0.7848407, F1 = 0.5222608 C = 10.000, l1_ratio = 0.750: Accuracy = 0.7848407, F1 = 0.5222608 C = 10.000, l1_ratio = 1.000: Accuracy = 0.7848407, F1 = 0.5222608 C = 100.000, l1_ratio = 0.000: Accuracy = 0.7849829, F1 = 0.5227273 C = 100.000, l1_ratio = 0.250: Accuracy = 0.7849829, F1 = 0.5227273 C = 100.000, l1_ratio = 0.500: Accuracy = 0.7848407, F1 = 0.5222608 C = 100.000, l1_ratio = 0.750: Accuracy = 0.7849829, F1 = 0.5224258 C = 100.000, l1_ratio = 1.000: Accuracy = 0.7849829, F1 = 0.5224258 Best C = 100.000 Best l1_ratio = 0.000 Best F1 = 0.5227273 Best Accuracy = 0.7849829
NOTE the following:
y.value_counts()
0 5163 1 1869 Name: Churn, dtype: int64
Ooops. Class imbalance problems? Use the class_weight argument to LogisticRegression():
class_weightdict or ‘balanced’, default=NoneWeights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y)).
n_samples = X.shape[0]
n_classes = 2
print(np.bincount(y))
n_samples/(n_classes * np.bincount(y))
[5163 1869]
array([0.68099942, 1.8812199 ])
from sklearn.metrics import f1_score
# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]
# define the range of l1_ratio (ElasticNet mix parameter) values to try
l1_range = [0, .25, .5, .75, 1]
# create empty list to store the score for each C value
scores = []
cs = []
l1_ratios = []
f1s = []
# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
for l1 in l1_range:
model = LogisticRegression(penalty='elasticnet',
C = C,
l1_ratio = l1,
max_iter = 10000,
class_weight = 'balanced',
solver='saga').fit(X,y)
sc = model.score(X, y)
scores.append(sc)
# store C, l1_ratio
cs.append(C)
l1_ratios.append(l1)
# predict the outcome on new data
y_pred = model.predict(X)
# calculate the F1 score
f1 = f1_score(y, y_pred)
f1s.append(f1)
# print the score for actual C, l1 value, accuracy, and F1
print(f"C = {C:.3f}, l1_ratio = {l1:.3f}: Accuracy = {sc:.7f}, F1 = {f1:.7f}")
# find the best C, l1_ratio values based on F1
best_ix = f1s.index(max(f1s))
best_C = cs[best_ix]
print(f"Best C = {best_C:.3f}")
best_l1_ratio = l1_ratios[best_ix]
print(f"Best l1_ratio = {best_l1_ratio:.3f}")
best_f1 = f1s[best_ix]
print(f"Best F1 = {best_f1:.7f}")
best_acc = scores[best_ix]
print(f"Best Accuracy = {best_acc:.7f}")
C = 0.001, l1_ratio = 0.000: Accuracy = 0.7197099, F1 = 0.5909940 C = 0.001, l1_ratio = 0.250: Accuracy = 0.7040671, F1 = 0.5830495 C = 0.001, l1_ratio = 0.500: Accuracy = 0.6868601, F1 = 0.5689115 C = 0.001, l1_ratio = 0.750: Accuracy = 0.6616894, F1 = 0.5455587 C = 0.001, l1_ratio = 1.000: Accuracy = 0.6396473, F1 = 0.5217063 C = 0.010, l1_ratio = 0.000: Accuracy = 0.7251138, F1 = 0.5840327 C = 0.010, l1_ratio = 0.250: Accuracy = 0.7238339, F1 = 0.5839760 C = 0.010, l1_ratio = 0.500: Accuracy = 0.7245449, F1 = 0.5881352 C = 0.010, l1_ratio = 0.750: Accuracy = 0.7245449, F1 = 0.5883103 C = 0.010, l1_ratio = 1.000: Accuracy = 0.7238339, F1 = 0.5875106 C = 0.100, l1_ratio = 0.000: Accuracy = 0.7236917, F1 = 0.5854491 C = 0.100, l1_ratio = 0.250: Accuracy = 0.7238339, F1 = 0.5853971 C = 0.100, l1_ratio = 0.500: Accuracy = 0.7238339, F1 = 0.5848653 C = 0.100, l1_ratio = 0.750: Accuracy = 0.7235495, F1 = 0.5837259 C = 0.100, l1_ratio = 1.000: Accuracy = 0.7235495, F1 = 0.5828326 C = 1.000, l1_ratio = 0.000: Accuracy = 0.7246871, F1 = 0.5875586 C = 1.000, l1_ratio = 0.250: Accuracy = 0.7246871, F1 = 0.5875586 C = 1.000, l1_ratio = 0.500: Accuracy = 0.7246871, F1 = 0.5875586 C = 1.000, l1_ratio = 0.750: Accuracy = 0.7245449, F1 = 0.5872576 C = 1.000, l1_ratio = 1.000: Accuracy = 0.7245449, F1 = 0.5872576 C = 10.000, l1_ratio = 0.000: Accuracy = 0.7241183, F1 = 0.5870583 C = 10.000, l1_ratio = 0.250: Accuracy = 0.7239761, F1 = 0.5867575 C = 10.000, l1_ratio = 0.500: Accuracy = 0.7239761, F1 = 0.5867575 C = 10.000, l1_ratio = 0.750: Accuracy = 0.7239761, F1 = 0.5867575 C = 10.000, l1_ratio = 1.000: Accuracy = 0.7239761, F1 = 0.5867575 C = 100.000, l1_ratio = 0.000: Accuracy = 0.7241183, F1 = 0.5870583 C = 100.000, l1_ratio = 0.250: Accuracy = 0.7241183, F1 = 0.5870583 C = 100.000, l1_ratio = 0.500: Accuracy = 0.7241183, F1 = 0.5870583 C = 100.000, l1_ratio = 0.750: Accuracy = 0.7241183, F1 = 0.5870583 C = 100.000, l1_ratio = 1.000: Accuracy = 0.7241183, F1 = 0.5870583 Best C = 0.001 Best l1_ratio = 0.000 Best F1 = 0.5909940 Best Accuracy = 0.7197099
The class imbalance problem can severely affect your model ROC analysis; especially Accuracy as an indicator of model fit does not make too much sense when class imbalance is present in the data.
License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.